SlideShare a Scribd company logo
1 of 28
Hadoop Ecosystem Boosts TensorFlow
and Machine Learning Technologies
Yanbo Liang
Apache Spark Committer
Hortonworks
Wangda Tan
Apache Hadoop PMC member
Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
 Overview of Machine Learning on Big Data
Platform
 How Apache Hadoop YARN boosts machine
learning workloads
 Example walkthrough: How to do Click-Through-
Rate on a big data platform
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Overview of Machine Learning on Big
Data Platform
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning on Big Data Platform
Apache Zeppelin
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning on Big Data Platform
Data Scientists Software engineers
Explore data
Create pipeline
Find best params
Save model
Load model
Deploy in production
Scoring on
batch/streaming data
Apache Zeppelin
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning workflow
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Data Preprocessing
Feature Engineering
Model
Training
Online
Service
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning in a Unified Platform
“Hidden Technical Debt in Machine Learning Systems”,
Google
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Data Preprocessing
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Import data
– HDFS
– AWS
– RDBMS
 Join data
 Data exploration
 Data sample
 Training/Test random split
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Feature Engineering
Feature
Selection
Data
Feature
Transform
Feature
Encoding
Feature
Evaluation
Feature
Engineering
 Feature transform/selection
 Feature embedding
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Model Training
Model
Training
Feature
Model
Evaluation
Model
Validation
Model
Staging
Model
Training
 Traditional machine
learning models
– Logistic Regression
– Gradient boosting tree
– Recommendation/ALS
– LDA
 Libraries
– Apache Spark MLlib
– XGBoost
 Deep learning models
– DNN
– CNN
– RNN
– LSTM
 Libraries
– TensorFlow
– Apache MXNet
– BigDL
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Model Training - Deep learning can’t fit all
 Natural language processing
 Computer vision
 Speech/Video
 Anti-fraud
 Recommendation
 CTR estimation
 Topic model
 PageRank
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning – Model Serving
Experiment
Online
Feature
Model
Database
Exper-
iment
Model as
Service
Real-time
Feature
Calibration
Online
Service
 Model deploy
 Model serving
– Batch
– Streaming
 Experiment
– offline
– online (A/B test)
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How Apache Hadoop YARN boosts
machine learning workloads
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine learning platform
Hadoop YARN
HDFS AWS S3 RDBMS
Spark MLlib XGBoost TensorFlow
Zeppelin
Hive/LLAP Spark SQL
CPU GPU SSD
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why all under YARN
SLA!
Monitoring!
A normal YARN user
Quotas!
Isolation!
Capacity Planning, Preemption, Reservation System.
Time time services, Grafana, etc.
Queues / Users quota, user access control.
CPU / Memory, (WIP) GPU, FPGA, Network
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
All running on the same YARN platform
LLAP
128 G 128 G 128 G 128 G 128 G
LLAP LLAP
128 G 128 G
GPUs
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GPU support on YARN
 Why?
 GPU: Many cores to handle massive (but simple) computation tasks simultaneously:
GPU CPU
GPU Computation Intensive Other
Without GPU support, researchers/engineers
are almost impossible to wait job finish.
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Challenges of GPU support
 Different levels of support
– Take me to a machine where GPUs are available with Partitions / Node Labels. (Current status)
– Take me to a machine where GPUs are available
• give me a full device only to me for the lifetime of my container
• give me multiple full devices only to me for the lifetime of my container
• give me full device(s) only to me for a portion of the lifetime of my container
• give me a slice of device(s) to me for a full / portion of the lifetime of my
container
 More dimensions:
– Bandwidths and on-GPU memory
– Topology of multiple GPUs
Slide credit to: Vinod Kumar Vavilapalli
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Makes everything easier!
 Forget about writing an application master, this is how you can run app on YARN ..
 Write assembly spec in JSON (we call it Yarnfile)
 Post the JSON as REST request to YARN server.
 YARN to figure out rest of it.
 An example:
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Run multi-stages job
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN assembly: Run Distributed Tensorflow Training (with PS)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Assembly: Parallel Parameter Tuning
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN Assembly: Model Serving and Update
 Application & Services
upgrades
– ”Do an upgrade of my
Tensorflow serving model
with minimal impact to end-
users”
- Use serving.tensorflow-mode-serving.wtan.domain:1234 to access the service.
- YARN could do load balancing for launched instances.
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example walkthrough: How to do Click-
Through-Rate on a big data platform
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Click-Through Rate (CTR) Prediction
 Given a user and context, predict probability of a click for an ad.
 Probably the most “profitable” machine learning problem in industry
 Basic setting quite well-studied; scale make it challenging
– Google, Facebook, Yahoo, Bing
 Challenges
– Simple binary problem; but want probabilities, not just the label
– Very skewed label distribution: clicks << skips
– Tons of data (every impression generates a training example)
– Limitations at serving: need to predict quickly
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Labeled events
Impression0 click
Impression1 non-click
Impression2 non-click
… …
Impression10 non-click
Impression11 click
Impression12 non-click
… …
Impression100 non-click
Impression101 non-click
Impression102 non-click
… …
Labeled events
Labeled events
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CTR model
 Logistic regression (LR)
– LR on SGD/LBFGS - batch
– Follow the regularized leader (FTRL) -
online
 Factorization Machines (FM/FFM)
 Gradient boosting tree (GBT)
 Deep neural networks (DNN)
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?

More Related Content

More from DataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachDataWorks Summit
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsDataWorks Summit
 

More from DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Hadoop ecosystem boosts Tensorflow and machine learning technologies

  • 1. Hadoop Ecosystem Boosts TensorFlow and Machine Learning Technologies Yanbo Liang Apache Spark Committer Hortonworks Wangda Tan Apache Hadoop PMC member Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda  Overview of Machine Learning on Big Data Platform  How Apache Hadoop YARN boosts machine learning workloads  Example walkthrough: How to do Click-Through- Rate on a big data platform
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview of Machine Learning on Big Data Platform
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning on Big Data Platform Apache Zeppelin
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning on Big Data Platform Data Scientists Software engineers Explore data Create pipeline Find best params Save model Load model Deploy in production Scoring on batch/streaming data Apache Zeppelin
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning workflow Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Model Training Feature Model Evaluation Model Validation Model Staging Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Data Preprocessing Feature Engineering Model Training Online Service
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning in a Unified Platform “Hidden Technical Debt in Machine Learning Systems”, Google
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Data Preprocessing Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Import data – HDFS – AWS – RDBMS  Join data  Data exploration  Data sample  Training/Test random split
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Feature Engineering Feature Selection Data Feature Transform Feature Encoding Feature Evaluation Feature Engineering  Feature transform/selection  Feature embedding
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Model Training Model Training Feature Model Evaluation Model Validation Model Staging Model Training  Traditional machine learning models – Logistic Regression – Gradient boosting tree – Recommendation/ALS – LDA  Libraries – Apache Spark MLlib – XGBoost  Deep learning models – DNN – CNN – RNN – LSTM  Libraries – TensorFlow – Apache MXNet – BigDL
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Model Training - Deep learning can’t fit all  Natural language processing  Computer vision  Speech/Video  Anti-fraud  Recommendation  CTR estimation  Topic model  PageRank
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning – Model Serving Experiment Online Feature Model Database Exper- iment Model as Service Real-time Feature Calibration Online Service  Model deploy  Model serving – Batch – Streaming  Experiment – offline – online (A/B test)
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How Apache Hadoop YARN boosts machine learning workloads
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine learning platform Hadoop YARN HDFS AWS S3 RDBMS Spark MLlib XGBoost TensorFlow Zeppelin Hive/LLAP Spark SQL CPU GPU SSD
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why all under YARN SLA! Monitoring! A normal YARN user Quotas! Isolation! Capacity Planning, Preemption, Reservation System. Time time services, Grafana, etc. Queues / Users quota, user access control. CPU / Memory, (WIP) GPU, FPGA, Network
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved All running on the same YARN platform LLAP 128 G 128 G 128 G 128 G 128 G LLAP LLAP 128 G 128 G GPUs
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved GPU support on YARN  Why?  GPU: Many cores to handle massive (but simple) computation tasks simultaneously: GPU CPU GPU Computation Intensive Other Without GPU support, researchers/engineers are almost impossible to wait job finish.
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Challenges of GPU support  Different levels of support – Take me to a machine where GPUs are available with Partitions / Node Labels. (Current status) – Take me to a machine where GPUs are available • give me a full device only to me for the lifetime of my container • give me multiple full devices only to me for the lifetime of my container • give me full device(s) only to me for a portion of the lifetime of my container • give me a slice of device(s) to me for a full / portion of the lifetime of my container  More dimensions: – Bandwidths and on-GPU memory – Topology of multiple GPUs Slide credit to: Vinod Kumar Vavilapalli
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Makes everything easier!  Forget about writing an application master, this is how you can run app on YARN ..  Write assembly spec in JSON (we call it Yarnfile)  Post the JSON as REST request to YARN server.  YARN to figure out rest of it.  An example:
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Run multi-stages job
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN assembly: Run Distributed Tensorflow Training (with PS)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Assembly: Parallel Parameter Tuning
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN Assembly: Model Serving and Update  Application & Services upgrades – ”Do an upgrade of my Tensorflow serving model with minimal impact to end- users” - Use serving.tensorflow-mode-serving.wtan.domain:1234 to access the service. - YARN could do load balancing for launched instances.
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example walkthrough: How to do Click- Through-Rate on a big data platform
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Click-Through Rate (CTR) Prediction  Given a user and context, predict probability of a click for an ad.  Probably the most “profitable” machine learning problem in industry  Basic setting quite well-studied; scale make it challenging – Google, Facebook, Yahoo, Bing  Challenges – Simple binary problem; but want probabilities, not just the label – Very skewed label distribution: clicks << skips – Tons of data (every impression generates a training example) – Limitations at serving: need to predict quickly
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Labeled events Impression0 click Impression1 non-click Impression2 non-click … … Impression10 non-click Impression11 click Impression12 non-click … … Impression100 non-click Impression101 non-click Impression102 non-click … … Labeled events Labeled events
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CTR model  Logistic regression (LR) – LR on SGD/LBFGS - batch – Follow the regularized leader (FTRL) - online  Factorization Machines (FM/FFM)  Gradient boosting tree (GBT)  Deep neural networks (DNN)
  • 28. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?

Editor's Notes

  1. Data is flooding into every business. In many applications, more training data and bigger models means better result. We use Hadoop to store large amount of data, use Spark on YARN for simple data processing, can also can try some machine learning frameworks such as TensorFlow or XGBoost on the hadoop-based big data platform for machine learning or deep learning.
  2. Another important change is the roles in machine learning. As the increasing dataset and more and more complex problem, one person can’t do all of the work, we need data scientists work together with software engineers. Data scientist usually explore data, find the best machine learning pipeline. After that, software engineer will deploy the model and make prediction based on new input. The input data could be batch data or streaming data.
  3. This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles. We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service. The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
  4. Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on. Let’s find out how big data infrastructure could help machine learning step by step.
  5. Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system. After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload. And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced. After that, we random split the dataset for training and test under the help of Spark. Once we get training data, we can start feature engineering.
  6. Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning. In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers. DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
  7. Model training is the most important step of the whole pipeline.
  8. Deep learning is becoming more and more powerful, but it can’t solve all of humanity‘s problems. In natural language processing, computer vision, speech or video recognition areas, deep learning may behavior better than traditional model. But for problems like recommendation or CTR estimation, very scalable linear models still play a major roles. And some graphic related model like topic model or PageRank, we still need graph calculation engine. Further more, hybrid model is becoming more and more useful. For example, Facebook presents a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classier, illustrated in the figure. Their experience tells us, this hybrid structure significantly increases the prediction accuracy. Google also developed a hybrid model - wide and deep learning model, which jointly train a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. From the above cases, we can learn that a machine learning platform should support traditional machine learning model and deep learning model, both of them are very useful.
  9. Deploy the model distributed for parallel model serving on batch mode or streaming mode. Evaluate the model offline or online by different metrics.
  10. Predicting Ad CTR is a massive-scale machine learning problem that is central to the multi-billion dollar online advertising industry. A typical CTR prediction problem shares some similarities with many other industry machine learning problems, which makes it very representative.
  11. Usually there are billions of Ad impressions daily. Each impression has an unique id. We need join impressions with click stream every x minutes as the dataset for machine learning.
  12. Each model has advantage and disadvantage: Non-linear models, on the other hand, are able to utilize different feature combinations and thus could potentially improve estimation performance, but can’t scale to a large number of parameter. Deep neural networks (DNNs) are able to extract the hidden structures and intrinsic patterns at different levels of abstractions from training data. But training deep neural networks on a large input feature space requires tuning a huge number of parameters, which is computationally expensive. And the input raw features are high dimensional and sparse binary features converted from the raw categorical features, which makes it hard to train traditional DNNs in large scale.
  13. The features for CTR prediction are drawn from a variety of sources, including the query, the text of the ad creative, and various ad or user related metadata. Then feed the data into the complex pipeline and push the model to online service.
  14. Data scientist A click the export button and save the notebook to a file system or cloud storage, then data scientist B can load this notebook from another web browser. Data scientist B can easily re-run the notebook and help to tune parameters.