In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing.
Last year, I described Yahoo’s effort to bring Spark onto YARN cluster.
Today, we should cover our progress on machine learning using YARN clusters.
I will cover 3 areas: WHY does Yahoo apply machine learning WHAT challenges we try to address HOW we address them
I will wrap the talk with key lessons learned from our experience.
At least year’s Hadoop Summit, we discussed how Hadoop clusters have become the preferred platform for large-scale machine learning at Yahoo. Recently, we introduced distributed deep learning as a new capability of Hadoop clusters. These new clusters augment our existing CPU nodes and Ethernet connectivity with GPU nodes and Infiniband connectivity. We developed a distributed deep learning solution, CaffeOnSpark, based on Apache Spark and Caffe from UC Berkeley. CaffeOnSpark enables deep learning tasks to be launched via spark-submit command, as in any Spark application. Given a partition of HDFS-based training data, each Spark executors launches Caffe-based training threads to train deep neural network models. After back-propagation processing of a batch of training examples, CaffeOnSpark training threads exchange the gradients of model parameters across all GPUs on multiple servers. In this talk, we will provide a technical overview of CaffeOnSpark, and explain how CaffeOnSpark conducts deep learning in a private cloud or public cloud (such as AWS EC2). We will share our experience at Yahoo through use cases (including photo auto tagging), and discuss the areas of collaboration with open source communities for Hadoop-based deep learning.
Deep learning is a branch of a branch of artificial intelligence. It attempts to model high-level abstractions in data by using multiple processing layers.
A deep neural network has multiple hidden layers of units between the input and output layers. * ImageNet competition 2014 … GoogleNet w/ 22 layers. * ILSVRC competition 2015 … Microsoft w/ 152 layers.
Many of these deep networks have millions or even billions of parameters. To learn these parameters from data, we go through many iterations of forward prediction and back propagation over these networks.
We released the magic view as part of the Flickr 4.0 release last April, and this is the most visible user-facing feature that exposes our image recognition capabilities. Our users can switch from the traditional timeline view of their photo to an experience where their photos are arranged according to 70 categories. For example, you can see here that landscape photos are sub-categorized into different types such as mountain, rock, or shore.
This is a great feature for serendipitous photo discovery. Most of us have thousands of photos that we don’t get to see very often but are emotionally very attached to, and these types of groupings help us re-discover photos.
To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers.
These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second.
Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection.
Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners.
To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations.
Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
Flickr DL/ML Pipeline
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
* 10 billion photos * 7.5 million per day
Hadoop Cluster Enhanced
GPU servers added
› 4 Tesla K80 cards
• 2 GK210 GPUs, 24GB memory
Network interface enhanced
› InfiniBand for direct access to GPU
› Ethernet for external communication
Deep Learning Frameworks
› Available since Sept, 2013, 6.3k forks
› Popular in vision community & Yahoo
› Released in Nov. 2015, 9.8k forks
Theano, Torch, DL4J, etc.
Released in Feb. 2016
• Apache 2.0 license
• Distributed deep learning
– GPU or CPU
– Ethernet or InfiniBand
• Easily deployed on public
cloud or private cloud
CaffeOnSpark Open Sourced
CaffeOnSpark: One Program (Scala)
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
// (1) training DL model
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
// (2) extract features via DL
lr_raw_source = DataSource.getSource(conf, false) ext_df =
// (3) apply ML
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
CaffeOnSpark: One Notebook (Python)
Demo: CaffeOnSpark on EC2
› Get started on EC2
› Python for CaffeOnSpark
CaffeOnSpark: What’s Next?
Validation within training
Enhanced data layer
RNN and LSTM
Asynchronous distributed training
Related Work: SparkNet & DL4J
1) [driver] sc.broadcast(model) to executors
2) [executor] apply DL training against a mini-batch of dataset to
update models locally
3) [driver] aggregate(models) to produce a new model
Yahoo Hadoop clusters enhanced for deep learning
› GPU nodes + CPU nodes
› Infiniband network for fast communication
CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
• In production since Q3 2015
• Reduced training latency, and improved accuracy
› Scalable deep learning made easy
• spark-submit on your Spark cluster