April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

•

6 likes•1,701 views

Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning. Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib). Speakers: Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure. Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel. Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.

Technology

CaffeOnSpark: Distributed Deep
Learning on Spark Clusters
A n d y F e n g, J u n S h i , a n d Mr i d u l J a i n
Ya h o o

 Apache 2.0 license
 Distributed deep learning
› GPU or CPU servers
› Ethernet or InfiniBand connection
 Easily deployed on public
cloud (ex. EC2) or private
cloud
3
CaffeOnSpark Open Sourced
EXAMPLE
SLIDE
github.com/yahoo/CaffeOnSpark

Agenda
4
 Deep Learning
 CaffeOnSpark Overview
› Distributed DL made easy
 CaffeOnSpark EC2 demo
› Launch, CLI and Ipython

Deep Learning
5
AlphaGo vs. Lee Sedol: 4:1 (2016)
MNIST (1998)
Deep Neural Network

• Released with Flickr 4.0
• https://flickr.com/cameraroll
• Photos organized according to
70 categories
• Allows serendipitous photo
discovery
DL Use Case:
Flickr Magic View

(4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
By Ray Boden
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale

Previous Practice: Multiple Programs on Multiple Clusters
8

CaffeOnSpark on Yahoo Cluster
10
ML server
Map
Reduce

$Single Spark Command  spark-submit --num-executors #_Processes --class com.yahoo.ml.CaffeOnSpark caffe-on-spark.jar -devices #_gpus_per_proc -conf solver_config_file -model model_file -train | -test Familiar Caffe Configuration layer { name: "data" type: "MemoryData" source_class=“com.yahoo.ml.caffe.LMDB” memory_data_param { source: ”hdfs:///mnist/trainingdata/" batch_size: 64; channels: 1; height: 28; width: 28; } … } 12 Caffe-on-Spark: DL Made Easy$

id label data
1 cat
2 cat
3 dog
Id label fn1 fn2
1 cat [1.5] [0.1, 2.1]
2 cat [1.7] [0.2, 1.0]
3 dog [11.9] [22.0, 2.0]
13
Data Formats: LMDB & DataFrame
Training/test dataframe
Feature dataframe
• LMDB … Lightning Memory-Mapped Database
• DataFrame … Distributed collection of data in named columns

CaffeOnSpark: One Program (Scala) http://bit.ly/21ZY1c2
14
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
//training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source) // extract features via DL
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df) …
Non-deepLearningDeepLearning

CaffeOnSpark: One Notebook (Python) http://bit.ly/1REZ0cN
15

CaffeOnSpark: Deployment Options
16
 Single node
› Spark-submit –master local
 Multiple nodes w/ ethernet connection
› Spark-submit –master URL –connection ethernet
› Ex. EC2
 Multiple nodes w/ Infiniband connection
› Spark-submit –master URL –connection infiniband
› Ex., Yahoo Hadoop cluster

Deep Learning: 19x Speedup (est.)
Training latency (hours)
Top-5ValidationError

Demo: CaffeOnSpark on EC2
19Yahoo Confidential & Proprietary
 Launch CaffeOnSpark
 CaffeOnSpark CLI
 CaffeOnSpark IPython Notebook

Summary
20
 CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
› Scalable DL made easy
› https://github.com/yahoo/CaffeOnSpark

What's hot

Deep Learning with Spark and GPUsDataWorks Summit

deep learning in production cff 2017Ari Kamlani

Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks

High Performance Python on Apache SparkWes McKinney

Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleSpark Summit

Pedal to the Metal: Accelerating Spark with Silicon InnovationJen Aman

CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit

Spark Summit EU talk by Jorg SchadSpark Summit

Apache Spark Performance is too hard. Let's make it easierDatabricks

Reactive Streams, Linking Reactive Application To Spark StreamingSpark Summit

Apache Spark on K8S Best Practice and Performance in the CloudDatabricks

Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot

Transactional writes to cloud storage with Eric LiangDatabricks

Simplified Cluster Operation & TroubleshootingDataWorks Summit/Hadoop Summit

Memory Management in Apache SparkDatabricks

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit

SSR: Structured Streaming for R and Machine Learningfelixcss

Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman

Apache Spark Performance: Past, Future and PresentDatabricks

What's hot (20)

Deep Learning with Spark and GPUs

deep learning in production cff 2017

Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen

High Performance Python on Apache Spark

Spark Summit 2016: Connecting Python to the Spark Ecosystem

GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale

Pedal to the Metal: Accelerating Spark with Silicon Innovation

CaffeOnSpark Update: Recent Enhancements and Use Cases

Spark Summit EU talk by Jorg Schad

Apache Spark Performance is too hard. Let's make it easier

Reactive Streams, Linking Reactive Application To Spark Streaming

Apache Spark on K8S Best Practice and Performance in the Cloud

Mobility insights at Swisscom - Understanding collective mobility in Switzerland

Transactional writes to cloud storage with Eric Liang

Simplified Cluster Operation & Troubleshooting

Memory Management in Apache Spark

High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...

SSR: Structured Streaming for R and Machine Learning

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Apache Spark Performance: Past, Future and Present

Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Distributed Deep Learning on Hadoop ClustersDataWorks Summit/Hadoop Summit

Integrating Deep Learning Libraries with Apache SparkDatabricks

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

EKON 24 ML_community_editionMax Kleiner

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructureAngelo Failla

Hands on with Apache SparkDan Lynn

Build Large-Scale Data Analytics and AI Pipeline Using RayDPDatabricks

Ingesting hdfs intosolrusingsparktrimmedwhoschek

Building a modern Application with DataFramesDatabricks

Building a modern Application with DataFramesSpark Summit

BigDL webinar - Deep Learning Library for SparkDESMOND YUEN

Microservices Application Tracing Standards and Simulators - Adrians at OSCONAdrian Cockcroft

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

GDG-MLOps using Protobuf in UnityIvan Chiou

Cloud Foundry V2 | Intermediate Deep DiveKazuto Kusama

20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo OmuraPreferred Networks

Unified Big Data Processing with Apache SparkC4Media

Better {ML} Together: GraphLab Create + Spark Turi, Inc.

Jump Start with Apache Spark 2.0 on DatabricksAnyscale

Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters (20)

Distributed Deep Learning on Hadoop Clusters

Integrating Deep Learning Libraries with Apache Spark

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

EKON 24 ML_community_edition

End-to-End Deep Learning with Horovod on Apache Spark

DevopsItalia2015 - DHCP at Facebook - Evolution of an infrastructure

Hands on with Apache Spark

Build Large-Scale Data Analytics and AI Pipeline Using RayDP

Ingesting hdfs intosolrusingsparktrimmed

Building a modern Application with DataFrames

BigDL webinar - Deep Learning Library for Spark

Microservices Application Tracing Standards and Simulators - Adrians at OSCON

Resource-Efficient Deep Learning Model Selection on Apache Spark

GDG-MLOps using Protobuf in Unity

Cloud Foundry V2 | Intermediate Deep Dive

20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura

Unified Big Data Processing with Apache Spark

Better {ML} Together: GraphLab Create + Spark

Jump Start with Apache Spark 2.0 on Databricks

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Commit 2024 - Secret Management made easyAlfredo García Lavilla

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Pigging Solutions in Pet Food ManufacturingPigging Solutions

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"ML in Production",Oleksandr BaganFwdays

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

WordPress Websites for Engineers: Elevate Your Brandgvaughan

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club

Powerpoint exploring the locations used in television show Time Clash

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Commit 2024 - Secret Management made easy

SQL Database Design For Developers at php[tek] 2024

Pigging Solutions in Pet Food Manufacturing

DMCC Future of Trade Web3 - Special Edition

Designing IA for AI - Information Architecture Conference 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Streamlining Python Development: A Guide to a Modern Project Setup

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Unraveling Multimodality with Large Language Models.pdf

My Hashitalk Indonesia April 2024 Presentation

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"ML in Production",Oleksandr Bagan

Human Factors of XR: Using Human Factors to Design XR Systems

SIP trunking in Janus @ Kamailio World 2024

WordPress Websites for Engineers: Elevate Your Brand

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

1. CaffeOnSpark: Distributed Deep Learning on Spark Clusters A n d y F e n g, J u n S h i , a n d Mr i d u l J a i n Ya h o o

2. 2 Scalable Machine Learning

3.  Apache 2.0 license  Distributed deep learning › GPU or CPU servers › Ethernet or InfiniBand connection  Easily deployed on public cloud (ex. EC2) or private cloud 3 CaffeOnSpark Open Sourced EXAMPLE SLIDE github.com/yahoo/CaffeOnSpark

4. Agenda 4  Deep Learning  CaffeOnSpark Overview › Distributed DL made easy  CaffeOnSpark EC2 demo › Launch, CLI and Ipython

5. Deep Learning 5 AlphaGo vs. Lee Sedol: 4:1 (2016) MNIST (1998) Deep Neural Network

6. • Released with Flickr 4.0 • https://flickr.com/cameraroll • Photos organized according to 70 categories • Allows serendipitous photo discovery DL Use Case: Flickr Magic View

7. (4) Apply ML Model @ Scale Flickr DL/ML Pipeline By Ray Boden (3) Non-deep Learning @ Scale * http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015 (2) Deep Learning @ Scale (1) Prepare Datasets @ Scale

8. Previous Practice: Multiple Programs on Multiple Clusters 8

9. CaffeOnSpark: Deep Learning on Spark 9

10. CaffeOnSpark on Yahoo Cluster 10 ML server Map Reduce

11. 11 CaffeOnSpark: Spark UI & Caffe Logs

12. Single Spark Command  spark-submit --num-executors #_Processes --class com.yahoo.ml.CaffeOnSpark caffe-on-spark.jar -devices #_gpus_per_proc -conf solver_config_file -model model_file -train | -test Familiar Caffe Configuration layer { name: "data" type: "MemoryData" source_class=“com.yahoo.ml.caffe.LMDB” memory_data_param { source: ”hdfs:///mnist/trainingdata/" batch_size: 64; channels: 1; height: 28; width: 28; } … } 12 Caffe-on-Spark: DL Made Easy

13. id label data 1 cat 2 cat 3 dog Id label fn1 fn2 1 cat [1.5] [0.1, 2.1] 2 cat [1.7] [0.2, 1.0] 3 dog [11.9] [22.0, 2.0] 13 Data Formats: LMDB & DataFrame Training/test dataframe Feature dataframe • LMDB … Lightning Memory-Mapped Database • DataFrame … Distributed collection of data in named columns

14. CaffeOnSpark: One Program (Scala) http://bit.ly/21ZY1c2 14 cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init() dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source) //training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df = cos.features(lr_raw_source) // extract features via DL lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label))) .withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model = lr.fit(lr_input_df) … Non-deepLearningDeepLearning

15. CaffeOnSpark: One Notebook (Python) http://bit.ly/1REZ0cN 15

16. CaffeOnSpark: Deployment Options 16  Single node › Spark-submit –master local  Multiple nodes w/ ethernet connection › Spark-submit –master URL –connection ethernet › Ex. EC2  Multiple nodes w/ Infiniband connection › Spark-submit –master URL –connection infiniband › Ex., Yahoo Hadoop cluster

17. CaffeOnSpark: Scalable Architecture 17

18. Deep Learning: 19x Speedup (est.) Training latency (hours) Top-5ValidationError

19. Demo: CaffeOnSpark on EC2 19Yahoo Confidential & Proprietary  Launch CaffeOnSpark  CaffeOnSpark CLI  CaffeOnSpark IPython Notebook

20. Summary 20  CaffeOnSpark open sourced › Empower Flickr and other Yahoo services › Scalable DL made easy › https://github.com/yahoo/CaffeOnSpark

Editor's Notes

To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers. These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second. Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection. Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners. To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations. Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore. This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily. Here is a screenshot from one Hadoop cluster. In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
(A) yinst i ycaffe -r ${YROOT} –nosudo (B) spark-submit --queue ${QUEUE} … -train\ -conf ${SOLVER_CONFIG_FILE} \ -input ${TRAIN_DATA_ON_HDFS} \ -model ${MODEL_ON_HDFS} (C) spark-submit --queue ${QUEUE} … -test\ -conf ${SOLVER_CONFIG}, ${EXTRA_CONFIG_FILES} \ -input ${TEST_DATA_ON_HDFS} \ -model ${MODEL_ON_HDFS}
DataFrame is a distributed collection of data organized into named columns Input: ImageDataFrames Required columns label, data Optional columns id, channels, height, width, encoded Mapping via configuration sampleId as id round(h)+1 as height Output: Feature DataFrames 2 fixed columns id label Variable columns features
In summary, Yahoo has made significant progress on scalable machine learning. We conduct daily training w/ billions of signals for our critical business such as search and advertisement. Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing. We are currently exploring both GPU and CPU in a single cluster.

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Similar to April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

April 2016 HUG: CaffeOnSpark: Distributed Deep Learning on Spark Clusters

Editor's Notes