Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning.
Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck.
Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). A demo of IPython notebook will also be given to demonstrate how CaffeOnSpark will work with other Spark packages (ex. MLlib).
Speakers:
Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure.
Jun Shi is a Principal Engineer at Yahoo who specializes in machine learning platforms and large-scale machine learning algorithms. Prior to Yahoo, he was designing wireless communication chips at Broadcom, Qualcomm and Intel.
Mridul Jain is Senior Principal at Yahoo, focusing on machine learning and big data platforms (especially realtime processing). He has worked on trending algorithms for search, unstructured content extraction, realtime processing for central monitoring platform, and is the co-author of Pig on Storm.
3. Apache 2.0 license
Distributed deep learning
› GPU or CPU servers
› Ethernet or InfiniBand connection
Easily deployed on public
cloud (ex. EC2) or private
cloud
3
CaffeOnSpark Open Sourced
EXAMPLE
SLIDE
github.com/yahoo/CaffeOnSpark
4. Agenda
4
Deep Learning
CaffeOnSpark Overview
› Distributed DL made easy
CaffeOnSpark EC2 demo
› Launch, CLI and Ipython
6. • Released with Flickr 4.0
• https://flickr.com/cameraroll
• Photos organized according to
70 categories
• Allows serendipitous photo
discovery
DL Use Case:
Flickr Magic View
7. (4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
By Ray Boden
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale
13. id label data
1 cat
2 cat
3 dog
Id label fn1 fn2
1 cat [1.5] [0.1, 2.1]
2 cat [1.7] [0.2, 1.0]
3 dog [11.9] [22.0, 2.0]
13
Data Formats: LMDB & DataFrame
Training/test dataframe
Feature dataframe
• LMDB … Lightning Memory-Mapped Database
• DataFrame … Distributed collection of data in named columns
14. CaffeOnSpark: One Program (Scala) http://bit.ly/21ZY1c2
14
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
//training DL model lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source) // extract features via DL
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df) …
Non-deepLearningDeepLearning
20. Summary
20
CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
› Scalable DL made easy
› https://github.com/yahoo/CaffeOnSpark
Editor's Notes
To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers.
These servers are a YARN application, specfically design for machine learning.
All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second.
Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection.
Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners.
To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers.
As an example, you may want to perform statistic analysis of large models using MapReduce operations.
Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore.
This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily.
Here is a screenshot from one Hadoop cluster.
In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
DataFrame is a distributed collection of data organized into named columns
Input: ImageDataFrames
Required columns
label, data
Optional columns
id, channels, height, width, encoded
Mapping via configuration
sampleId as id
round(h)+1 as height
Output: Feature DataFrames
2 fixed columns
id
label
Variable columns
features
In summary, Yahoo has made significant progress on scalable machine learning.
We conduct daily training w/ billions of signals for our critical business such as search and advertisement.
Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing.
We are currently exploring both GPU and CPU in a single cluster.