SlideShare a Scribd company logo
1 of 36
TensorFlow on Apache Hadoop YARN
Sunil Govindan
Apache Hadoop PMC member
Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Machine Learning on Big Data Platform
 TensorFlow on Apache Hadoop YARN
 Example walkthrough with demo
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache YARN - Past, Present and Future
Sunil G
Hortonworks
https://dataworkssummit.com/sydney-2017/sessions/yarn-past-present-future/
Wednesday September 20th Room C4.5 5.10 PM
-- Related Session --
Rohith Sharma K S
Hortonworks
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning on Big Data Platform
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning on Big Data Platform
Apache Zeppelin
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning on Big Data Platform
Data Scientists Software engineers
Explore data
Create pipeline
Find best params
Save model
Load model
Deploy in production
Scoring on
batch/streaming data
Apache Zeppelin
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning in a Unified Platform
“Hidden Technical Debt in Machine Learning Systems”,
Google
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Model Training - Deep learning can’t fit all
 Natural language processing
 Computer vision
 Speech/Video
 Anti-fraud
 Recommendation
 CTR estimation
 Topic model
 PageRank
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow on Apache Hadoop YARN
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop YARN Overview
 YARN is Apache Hadoop’s resource management framework
 YARN is made up of the following components:
– ResourceManager (RM) allocates resources to applications, is the master service for
YARN
– Per-node NodeManager (NM) launches application containers and monitors their
resource usage
– Per-application ApplicationMaster (AM) requests containers for the application
 At its core, YARN is responsible for managing “containers” across a collection of
servers
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning platform on YARN
CPU GPU SSD
YARN: Data Operating System
(Cluster Resource Management)
Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow
Zeppelin
HDFS AWS S3 RDBMS
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why all under YARN
• CPU / Memory, (WIP)
GPU, FPGA, Network
• Queues / Users quota,
user access control
• Time line services,
Grafana, etc
• Capacity Planning,
Preemption,
Reservation System
Better SLA Monitoring
IsolationGovernance
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Recent works in YARN to support ML workloads like Tensorflow
 GPU Support – Boost workload performance from GPU
 Resource Profile and support for Custom Resource Types - Support for new resources
 Native Service - Easy to define and run any custom service
 YARN Assembly - Easy to wire different apps
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support on YARN
 Effort led by Wangda Tan from Hortonworks
 Why?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
GPU Computation Intensive Other
Without GPU support, researchers/engineers
has to wait longer to finish apps.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Challenges of GPU support
 Different levels of support
– Take me to a machine where GPUs are available with Partitions / Node Labels. (Current status)
– Take me to a machine where GPUs are available
• give me a full device only to me for the lifetime of my container
• give me multiple full devices only to me for the lifetime of my container
• give me full device(s) only to me for a portion of the lifetime of my container
• give me a slice of device(s) to me for a full / portion of the lifetime of my
container
 More dimensions:
– Bandwidths and on-GPU memory
– Topology of multiple GPUs
Slide credit to: Vinod Kumar Vavilapalli
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Resource profiles and custom resource types
 Past
– Supports only Memory and CPU
 Now
– A generalized vector
– Admins can create custom Resource Types!
– Ease of resource requesting model using
profiles
NodeManager
Memory
CPU
GPU
NodeManager
Memory
CPU
GPU
ResourceManager
Small
Medium
Large
Profile Memory CPU GPU
Small 2 GB 4 Cores 1 Cores
Medium 4 GB 8 Cores 1 Cores
Large 16 GB 16 Cores 4 Cores
Application Master
Small
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Native Services
• Long Running – Simplify the deployment and management of long running
applications on YARN.
• Easily Add New Applications – Remove tedious process of bringing new applications
to YARN.
• Easy to Use – REST API and Command Line tools.
• Declarative Configuration – Provide configuration to the applications, declare
resource needs, specify placement policies.
• Service Discovery – Simplified service discovery on YARN
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Service assembly
• Collection of services/apps logically wired together for a common purpose
• Easily deploy services (zookeeper + Hbase) as a combination through REST
API.
• User defined service dependencies and readiness
• Container localities - affinity and anti-affinity scheduling
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Run multi-stages job
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Run Distributed Tensorflow Training (with PS)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Assembly: Parallel Parameter Tuning
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Assembly: Model Serving and Update
 Application & Services
upgrades
– ”Do an upgrade of my
Tensorflow serving model
with minimal impact to end-
users”
- Use serving.tensorflow-mode-serving.wtan.domain:1234 to access the service.
- YARN could do load balancing for launched instances.
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example walkthrough: How to do Click-
Through-Rate on a big data platform
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Click-Through Rate (CTR) Prediction
 Given a user and context, predict probability of a click for an ad.
 Probably the most “profitable” machine learning problem in industry
 Basic setting quite well-studied; scale make it challenging
– Google, Facebook, Yahoo, Bing
 Challenges
– Simple binary problem; but want probabilities, not just the label
– Very skewed label distribution: clicks << skips
– Tons of data (every impression generates a training example)
– Limitations at serving: need to predict quickly
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Labeled events
Impression0 click
Impression1 non-click
Impression2 non-click
… …
Impression10 non-click
Impression11 click
Impression12 non-click
… …
Impression100 non-click
Impression101 non-click
Impression102 non-click
… …
Labeled events
Labeled events
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CTR model
 Logistic regression (LR)
– LR on SGD/LBFGS - batch
– Follow the regularized leader (FTRL) -
online
 Factorization Machines (FM/FFM)
 Gradient boosting tree (GBT)
 Deep neural networks (DNN)
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
BoF’s: Apache Hadoop – YARN, HDFS
Sanjay Radia
Hortonworks
Sunil G
Hortonworks
https://dataworkssummit.com/sydney-2017/birds-of-a-feather/apache-hadoop-yarn-hdfs
Thursday September 21st Room C2.1 6.00PM
-- Related Session --
Rohith Sharma K S
Hortonworks
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!

More Related Content

More from DataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 

More from DataWorks Summit (20)

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Tensorflow on Apache Hadoop YARN

  • 1. TensorFlow on Apache Hadoop YARN Sunil Govindan Apache Hadoop PMC member Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Machine Learning on Big Data Platform  TensorFlow on Apache Hadoop YARN  Example walkthrough with demo
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache YARN - Past, Present and Future Sunil G Hortonworks https://dataworkssummit.com/sydney-2017/sessions/yarn-past-present-future/ Wednesday September 20th Room C4.5 5.10 PM -- Related Session -- Rohith Sharma K S Hortonworks
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning on Big Data Platform
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning on Big Data Platform Apache Zeppelin
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning on Big Data Platform Data Scientists Software engineers Explore data Create pipeline Find best params Save model Load model Deploy in production Scoring on batch/streaming data Apache Zeppelin
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning workflow Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Model Training Feature Model Evaluation Model Validation Model Staging Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Data Preprocessing Feature Engineering Model Training Online Service
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning in a Unified Platform “Hidden Technical Debt in Machine Learning Systems”, Google
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Data Preprocessing Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Import data – HDFS – AWS – RDBMS  Join data  Data exploration  Data sample  Training/Test random split
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Feature Engineering Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Feature transform/selection  Feature embedding
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Model Training Model Training Feature Model Evaluation Model Validation Model Staging Model Training  Traditional machine learning models – Logistic Regression – Gradient boosting tree – Recommendation/ALS – LDA  Libraries – Apache Spark MLlib – XGBoost  Deep learning models – DNN – CNN – RNN – LSTM  Libraries – TensorFlow – Apache MXNet – BigDL
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Model Training - Deep learning can’t fit all  Natural language processing  Computer vision  Speech/Video  Anti-fraud  Recommendation  CTR estimation  Topic model  PageRank
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Model Serving Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Online Service  Model deploy  Model serving – Batch – Streaming  Experiment – offline – online (A/B test)
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow on Apache Hadoop YARN
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop YARN Overview  YARN is Apache Hadoop’s resource management framework  YARN is made up of the following components: – ResourceManager (RM) allocates resources to applications, is the master service for YARN – Per-node NodeManager (NM) launches application containers and monitors their resource usage – Per-application ApplicationMaster (AM) requests containers for the application  At its core, YARN is responsible for managing “containers” across a collection of servers
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning platform on YARN CPU GPU SSD YARN: Data Operating System (Cluster Resource Management) Spark MLlib XGBoost Hive/LLAP Spark SQLTensorFlow Zeppelin HDFS AWS S3 RDBMS
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why all under YARN • CPU / Memory, (WIP) GPU, FPGA, Network • Queues / Users quota, user access control • Time line services, Grafana, etc • Capacity Planning, Preemption, Reservation System Better SLA Monitoring IsolationGovernance
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved All running on the same YARN platform LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Recent works in YARN to support ML workloads like Tensorflow  GPU Support – Boost workload performance from GPU  Resource Profile and support for Custom Resource Types - Support for new resources  Native Service - Easy to define and run any custom service  YARN Assembly - Easy to wire different apps
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support on YARN  Effort led by Wangda Tan from Hortonworks  Why?  GPU: Many cores to handle massive (but simple) computation tasks simultaneously: GPU CPU GPU Computation Intensive Other Without GPU support, researchers/engineers has to wait longer to finish apps.
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Challenges of GPU support  Different levels of support – Take me to a machine where GPUs are available with Partitions / Node Labels. (Current status) – Take me to a machine where GPUs are available • give me a full device only to me for the lifetime of my container • give me multiple full devices only to me for the lifetime of my container • give me full device(s) only to me for a portion of the lifetime of my container • give me a slice of device(s) to me for a full / portion of the lifetime of my container  More dimensions: – Bandwidths and on-GPU memory – Topology of multiple GPUs Slide credit to: Vinod Kumar Vavilapalli
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Resource profiles and custom resource types  Past – Supports only Memory and CPU  Now – A generalized vector – Admins can create custom Resource Types! – Ease of resource requesting model using profiles NodeManager Memory CPU GPU NodeManager Memory CPU GPU ResourceManager Small Medium Large Profile Memory CPU GPU Small 2 GB 4 Cores 1 Cores Medium 4 GB 8 Cores 1 Cores Large 16 GB 16 Cores 4 Cores Application Master Small
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Native Services • Long Running – Simplify the deployment and management of long running applications on YARN. • Easily Add New Applications – Remove tedious process of bringing new applications to YARN. • Easy to Use – REST API and Command Line tools. • Declarative Configuration – Provide configuration to the applications, declare resource needs, specify placement policies. • Service Discovery – Simplified service discovery on YARN
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Service assembly • Collection of services/apps logically wired together for a common purpose • Easily deploy services (zookeeper + Hbase) as a combination through REST API. • User defined service dependencies and readiness • Container localities - affinity and anti-affinity scheduling
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Makes everything easier!  Forget about writing an application master, this is how you can run app on YARN ..  Write assembly spec in JSON (we call it Yarnfile)  Post the JSON as REST request to YARN server.  YARN to figure out rest of it.  An example:
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Run multi-stages job
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Run Distributed Tensorflow Training (with PS)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Assembly: Parallel Parameter Tuning
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Assembly: Model Serving and Update  Application & Services upgrades – ”Do an upgrade of my Tensorflow serving model with minimal impact to end- users” - Use serving.tensorflow-mode-serving.wtan.domain:1234 to access the service. - YARN could do load balancing for launched instances.
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example walkthrough: How to do Click- Through-Rate on a big data platform
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Click-Through Rate (CTR) Prediction  Given a user and context, predict probability of a click for an ad.  Probably the most “profitable” machine learning problem in industry  Basic setting quite well-studied; scale make it challenging – Google, Facebook, Yahoo, Bing  Challenges – Simple binary problem; but want probabilities, not just the label – Very skewed label distribution: clicks << skips – Tons of data (every impression generates a training example) – Limitations at serving: need to predict quickly
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Labeled events Impression0 click Impression1 non-click Impression2 non-click … … Impression10 non-click Impression11 click Impression12 non-click … … Impression100 non-click Impression101 non-click Impression102 non-click … … Labeled events Labeled events
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CTR model  Logistic regression (LR) – LR on SGD/LBFGS - batch – Follow the regularized leader (FTRL) - online  Factorization Machines (FM/FFM)  Gradient boosting tree (GBT)  Deep neural networks (DNN)
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved BoF’s: Apache Hadoop – YARN, HDFS Sanjay Radia Hortonworks Sunil G Hortonworks https://dataworkssummit.com/sydney-2017/birds-of-a-feather/apache-hadoop-yarn-hdfs Thursday September 21st Room C2.1 6.00PM -- Related Session -- Rohith Sharma K S Hortonworks
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you!

Editor's Notes

  1. Data is flooding into every business. In many applications, we need more training data and bigger models means better result. We use Hadoop to store large amount of data, use Spark on YARN for simple data processing, can also can try some machine learning frameworks such as TensorFlow or XGBoost on the hadoop-based big data platform for machine learning or deep learning.
  2. Another important change is the roles in machine learning. As the increasing dataset and more and more complex problem, one person can’t do all of the work, we need data scientists work together with software engineers. Data scientist usually explore data, find the best machine learning pipeline. After that, software engineer will deploy the model and make prediction based on new input. The input data could be batch data or streaming data.
  3. This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles. We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service. The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
  4. Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on. Let’s find out how big data infrastructure could help machine learning step by step.
  5. Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system. After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload. And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced. After that, we random split the dataset for training and test under the help of Spark. Once we get training data, we can start feature engineering.
  6. Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning. In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers. DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
  7. Model training is the most important step of the whole pipeline.
  8. Deep learning is becoming more and more powerful, but it can’t solve all of humanity‘s problems. In natural language processing, computer vision, speech or video recognition areas, deep learning may behavior better than traditional model. But for problems like recommendation or CTR estimation, very scalable linear models still play a major roles. And some graphic related model like topic model or PageRank, we still need graph calculation engine. Further more, hybrid model is becoming more and more useful. For example, Facebook presents a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classier, illustrated in the figure. Their experience tells us, this hybrid structure significantly increases the prediction accuracy. Google also developed a hybrid model - wide and deep learning model, which jointly train a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. From the above cases, we can learn that a machine learning platform should support traditional machine learning model and deep learning model, both of them are very useful.
  9. Deploy the model distributed for parallel model serving on batch mode or streaming mode. Evaluate the model offline or online by different metrics.
  10. Predicting Ad CTR is a massive-scale machine learning problem that is central to the multi-billion dollar online advertising industry. A typical CTR prediction problem shares some similarities with many other industry machine learning problems, which makes it very representative.
  11. Usually there are billions of Ad impressions daily. Each impression has an unique id. We need join impressions with click stream every x minutes as the dataset for machine learning.
  12. Each model has advantage and disadvantage: Non-linear models, on the other hand, are able to utilize different feature combinations and thus could potentially improve estimation performance, but can’t scale to a large number of parameter. Deep neural networks (DNNs) are able to extract the hidden structures and intrinsic patterns at different levels of abstractions from training data. But training deep neural networks on a large input feature space requires tuning a huge number of parameters, which is computationally expensive. And the input raw features are high dimensional and sparse binary features converted from the raw categorical features, which makes it hard to train traditional DNNs in large scale.