SlideShare a Scribd company logo
Running Tensorflow on Apache YARN –
A sneak peak into GPU Scheduling
Sunil Govindan
Apache Hadoop PMC member
YARN Team @ Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Overview to Machine Learning on Big Data Platform
 GPU support in Apache Hadoop YARN
 Tensorflow on YARN – example and demo
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Overview:
Machine Learning on Big Data Platform
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning (BigData) – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support in Apache Hadoop YARN
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning platform on YARN
CPU GPU SSD
YARN: Data Operating System
(Cluster Resource Management)
Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow
Zeppelin
HDFS AWS S3 RDBMS
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why GPU?
 GPU can speed up following computation-
intensive applications 10x - 300x times
Gene Analysis
Deep learningSelf-Driving Car
Scientific Computation
Without GPU speed up, you will almost
impossible to do these computations. (If job
runs for weeks).
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why GPU?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
Computation Intensive Other
Nvidia Tesla K40:
2880 CUDA cores.
$2200.00 => $0.76 / core
Intel Xeon E5-2697
14 cores
$2295.00 => $163 / core
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why all under YARN
SLA!
Monitoring!A normal YARN user
Quotas!
Isolation!
Capacity Planning, Preemption, Reservation System.
Time line services, Grafana, etc.
CPU / Memory, (WIP) GPU, FPGA, Network
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current status of GPU support on YARN
 Using node label (YARN-796), since Apache Hadoop 2.6.0
– Use node label to partition one big cluster to smaller disjoint clusters, and assign shares/acls to
queues.
– Issues: 1) GPU is not a countable resource in scheduling. 2) No proper isolation for GPU.
 Rest part of GPU support is WIP, umbrella JIRA: YARN-6223
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 GPU isolation
– Different from memory / cpu, computations affinity to per-GPU-device.
– And multiple processes use the single GPU will be serialized. (MPS is an exception).
– And multiple process share the same GPU cause OOM easily.
• Even though TF provide options to use GPU memory less than whole device provided. But we
cannot enforce this from external.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 Hierarchy of GPUs matters:
– Topology of GPU really matters: affect communication latency a lot! (Von Neumann bottleneck)
Picture credit to: https://opus.nci.org.au
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support: Challenges
 GPU on Docker: Build once and run
anywhere is not simple:
 For a regular app:
 It can run on Centos 6/7, or any different
hosts as well as CPU arch is same.
 However, GPU application needs driver to
talk to hardware:
Nginx App
Nginx AppUbuntu 14:04
Tensorflow 1.2
Nginx AppUbuntu 14:04
GPU Base Lib v2
Nginx AppHost OS
GPU Base Lib v1
X Fails
CUDA Library 5.0
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU Support : Solutions
 GPU isolation:
– With general resource types feature:
• detect & report number of GPUs to YARN scheduler, and scheduler make central decision.
– For normal processes: use cgroups: device submodule. (Same as cpu/memory isolation
mechanism)
– For docker processes: use --device command line before launch docker container.
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU Support : Solutions
 GPU on Docker support
– By using nvidia-docker-plugin.
Tensorflow 1.2
Nginx AppUbuntu 14:04
Nginx AppHost OS
GPU Base Lib v1
Volume Mount
CUDA Library 5.0
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How rest of YARN helps GPU support.
 Node partition
– Without node partition, cannot guarantee
best GPU utilizations, let’s look at an example:
– Two hosts in the cluster, only host1 has GPUs.
At the beginning, cluster is empty.
– At time T1, user submit a Spark job, which
need 10G mem, 4 CPUs. Without node
partition, it could be placed to Host1
– If we have another job, which needs 15G
memory, 6 CPUs and 3 GPUs, it won’t possible
to get allocated.
20G
8
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
10G
4
4
Mem
CPU
GPU
20G
8
Host1 (GPU)
Host2
Task1
?
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How rest of YARN helps GPU support.
 Resource Profiles
– A generalized vector
– Admins can create custom Resource Types!
– Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
NodeManager
Memory
CPU
GPU
ResourceManager
Small
Medium
Large
Profile Memory CPU GPU
Small 2 GB 4 Cores 1 Cores
Medium 4 GB 8 Cores 1 Cores
Large 16 GB 16 Cores 4 Cores
Application Master
Small
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current development status (YARN-6223)
 Apache Hadoop 3.1.0 release (Jan 15, 2018)
– GPU auto detection (Merged)
– GPU scheduling in RM (Merged)
– GPU isolation using Cgroups. (Merged)
– GPU on docker isolation & volume. (Merged)
– UI / Metrics (Merged).
– Documentation (Open)
– Ambari changes (Open)
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow on Apache Hadoop YARN
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo….
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?

More Related Content

What's hot

February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
Yahoo Developer Network
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
Databricks
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
Ramnaresh Mantri
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
Erik Krogen
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
enissoz
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
DataWorks Summit
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
DataWorks Summit/Hadoop Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Set Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopSet Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into Hadoop
Continuent
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-training
conline training
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
HBaseCon
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
DataWorks Summit
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Databricks
 

What's hot (20)

February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
TeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage DevicesTeraCache: Efficient Caching Over Fast Storage Devices
TeraCache: Efficient Caching Over Fast Storage Devices
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 Improving Apache Spark by Taking Advantage of Disaggregated Architecture Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 
1.0 vs2.0
1.0 vs2.01.0 vs2.0
1.0 vs2.0
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Mapreduce over snapshots
Mapreduce over snapshotsMapreduce over snapshots
Mapreduce over snapshots
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Set Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into HadoopSet Up & Operate Real-Time Data Loading into Hadoop
Set Up & Operate Real-Time Data Loading into Hadoop
 
Bn 1016 demo postgre sql-online-training
Bn 1016 demo  postgre sql-online-trainingBn 1016 demo  postgre sql-online-training
Bn 1016 demo postgre sql-online-training
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored NodesAchieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes
 
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated ArchitectureImproving Apache Spark by Taking Advantage of Disaggregated Architecture
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
 

Similar to [Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
DataWorks Summit
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Wangda Tan
 
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Wangda Tan
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
Sergey Karayev
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Yahoo Developer Network
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
Volodymyr Saviak
 
Azinta Gpu Cloud Services London Financial Python Ug 1.2
Azinta Gpu Cloud Services   London Financial Python Ug 1.2Azinta Gpu Cloud Services   London Financial Python Ug 1.2
Azinta Gpu Cloud Services London Financial Python Ug 1.2
Suleiman Shehu
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
John Blum
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
RCCSRENKEI
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
DataWorks Summit
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
Linaro
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
DataWorks Summit
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
 
IBM: The Linux Ecosystem
IBM: The Linux EcosystemIBM: The Linux Ecosystem
IBM: The Linux Ecosystem
Kangaroot
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 

Similar to [Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan (20)

Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
 
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
Running Tensorflow In Production: Challenges and Solutions on YARN 3.x
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Azinta Gpu Cloud Services London Financial Python Ug 1.2
Azinta Gpu Cloud Services   London Financial Python Ug 1.2Azinta Gpu Cloud Services   London Financial Python Ug 1.2
Azinta Gpu Cloud Services London Financial Python Ug 1.2
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Light-weighted HDFS disaster recovery
Light-weighted HDFS disaster recoveryLight-weighted HDFS disaster recovery
Light-weighted HDFS disaster recovery
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
IBM: The Linux Ecosystem
IBM: The Linux EcosystemIBM: The Linux Ecosystem
IBM: The Linux Ecosystem
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 

Recently uploaded

Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
NishanthaBulumulla1
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
GohKiangHock
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
Bert Jan Schrijver
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 

Recently uploaded (20)

Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
YAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring detailsYAML crash COURSE how to write yaml file for adding configuring details
YAML crash COURSE how to write yaml file for adding configuring details
 
SQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure MalaysiaSQL Accounting Software Brochure Malaysia
SQL Accounting Software Brochure Malaysia
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
J-Spring 2024 - Going serverless with Quarkus, GraalVM native images and AWS ...
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 

[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan

  • 1. Running Tensorflow on Apache YARN – A sneak peak into GPU Scheduling Sunil Govindan Apache Hadoop PMC member YARN Team @ Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Overview to Machine Learning on Big Data Platform  GPU support in Apache Hadoop YARN  Tensorflow on YARN – example and demo
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview: Machine Learning on Big Data Platform
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning workflow Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Model Training Feature Model Evaluation Model Validation Model Staging Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Data Preprocessing Feature Engineering Model Training Online Service
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Data Preprocessing Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Import data – HDFS – AWS – RDBMS  Join data  Data exploration  Data sample  Training/Test random split
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Feature Engineering Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Feature transform/selection  Feature embedding
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Model Training Model Training Feature Model Evaluation Model Validation Model Staging Model Training  Traditional machine learning models – Logistic Regression – Gradient boosting tree – Recommendation/ALS – LDA  Libraries – Apache Spark MLlib – XGBoost  Deep learning models – DNN – CNN – RNN – LSTM  Libraries – TensorFlow – Apache MXNet – BigDL
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning (BigData) – Model Serving Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Online Service  Model deploy  Model serving – Batch – Streaming  Experiment – offline – online (A/B test)
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support in Apache Hadoop YARN
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning platform on YARN CPU GPU SSD YARN: Data Operating System (Cluster Resource Management) Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow Zeppelin HDFS AWS S3 RDBMS
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why GPU?  GPU can speed up following computation- intensive applications 10x - 300x times Gene Analysis Deep learningSelf-Driving Car Scientific Computation Without GPU speed up, you will almost impossible to do these computations. (If job runs for weeks).
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why GPU?  GPU: Many cores to handle massive (but simple) computation tasks simultaneously: GPU CPU Computation Intensive Other Nvidia Tesla K40: 2880 CUDA cores. $2200.00 => $0.76 / core Intel Xeon E5-2697 14 cores $2295.00 => $163 / core
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why all under YARN SLA! Monitoring!A normal YARN user Quotas! Isolation! Capacity Planning, Preemption, Reservation System. Time line services, Grafana, etc. CPU / Memory, (WIP) GPU, FPGA, Network
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved All running on the same YARN platform LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current status of GPU support on YARN  Using node label (YARN-796), since Apache Hadoop 2.6.0 – Use node label to partition one big cluster to smaller disjoint clusters, and assign shares/acls to queues. – Issues: 1) GPU is not a countable resource in scheduling. 2) No proper isolation for GPU.  Rest part of GPU support is WIP, umbrella JIRA: YARN-6223
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  GPU isolation – Different from memory / cpu, computations affinity to per-GPU-device. – And multiple processes use the single GPU will be serialized. (MPS is an exception). – And multiple process share the same GPU cause OOM easily. • Even though TF provide options to use GPU memory less than whole device provided. But we cannot enforce this from external.
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  Hierarchy of GPUs matters: – Topology of GPU really matters: affect communication latency a lot! (Von Neumann bottleneck) Picture credit to: https://opus.nci.org.au
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support: Challenges  GPU on Docker: Build once and run anywhere is not simple:  For a regular app:  It can run on Centos 6/7, or any different hosts as well as CPU arch is same.  However, GPU application needs driver to talk to hardware: Nginx App Nginx AppUbuntu 14:04 Tensorflow 1.2 Nginx AppUbuntu 14:04 GPU Base Lib v2 Nginx AppHost OS GPU Base Lib v1 X Fails CUDA Library 5.0
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU Support : Solutions  GPU isolation: – With general resource types feature: • detect & report number of GPUs to YARN scheduler, and scheduler make central decision. – For normal processes: use cgroups: device submodule. (Same as cpu/memory isolation mechanism) – For docker processes: use --device command line before launch docker container.
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU Support : Solutions  GPU on Docker support – By using nvidia-docker-plugin. Tensorflow 1.2 Nginx AppUbuntu 14:04 Nginx AppHost OS GPU Base Lib v1 Volume Mount CUDA Library 5.0
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How rest of YARN helps GPU support.  Node partition – Without node partition, cannot guarantee best GPU utilizations, let’s look at an example: – Two hosts in the cluster, only host1 has GPUs. At the beginning, cluster is empty. – At time T1, user submit a Spark job, which need 10G mem, 4 CPUs. Without node partition, it could be placed to Host1 – If we have another job, which needs 15G memory, 6 CPUs and 3 GPUs, it won’t possible to get allocated. 20G 8 4 Mem CPU GPU 20G 8 Host1 (GPU) Host2 10G 4 4 Mem CPU GPU 20G 8 Host1 (GPU) Host2 Task1 ?
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How rest of YARN helps GPU support.  Resource Profiles – A generalized vector – Admins can create custom Resource Types! – Ease of resource requesting model using profiles NodeManager Memory CPU GPU NodeManager Memory CPU GPU ResourceManager Small Medium Large Profile Memory CPU GPU Small 2 GB 4 Cores 1 Cores Medium 4 GB 8 Cores 1 Cores Large 16 GB 16 Cores 4 Cores Application Master Small
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current development status (YARN-6223)  Apache Hadoop 3.1.0 release (Jan 15, 2018) – GPU auto detection (Merged) – GPU scheduling in RM (Merged) – GPU isolation using Cgroups. (Merged) – GPU on docker isolation & volume. (Merged) – UI / Metrics (Merged). – Documentation (Open) – Ambari changes (Open)
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow on Apache Hadoop YARN
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Makes everything easier!  Forget about writing an application master, this is how you can run app on YARN ..  Write assembly spec in JSON (we call it Yarnfile)  Post the JSON as REST request to YARN server.  YARN to figure out rest of it.  An example:
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo….
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?

Editor's Notes

  1. This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles. We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service. The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
  2. Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system. After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload. And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced. After that, we random split the dataset for training and test under the help of Spark. Once we get training data, we can start feature engineering.
  3. Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning. In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers. DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
  4. Model training is the most important step of the whole pipeline.
  5. Deploy the model distributed for parallel model serving on batch mode or streaming mode. Evaluate the model offline or online by different metrics.