Tensorflow™ is one of the most popular open source projects for machine learning and deep learning, which can handle enterprise use cases like image recognition, video analytics, audio translation, etc. However, training deep learning model was very expensive which requires lots of GPU resources. Also, a real-life distributed Tensorflow application needs a bunch of services such as workers, parameter servers, TensorBoard, etc. work together. Those services need to be carefully configured to make them can talk to each other.
To make distributed TF applications can be easily launched, managed, monitored by YARN, we introduced YARN service assembly along with other improvements such as GPU support, container-DNS support, scheduling improvements, etc. These improvements make distributed TF applications can run on YARN as simple as run it locally, which can let TF developers focus on deep learning algorithms instead of worrying about underlying infrastructure. Also, YARN can better manage a shared cluster which runs TF and other services/batch jobs with these improvements.
During this session, we will take a closer look at these improvements, and we will do a demo of running a distributed TF assembly which consists of workers, parameter servers, TensorBoard and prediction servers on YARN.
Speaker:
Sunil Govindan, Senior Software Engineer, Hortonworks
Data is flooding into every business. In many applications, we need more training data and bigger models means better result. We use Hadoop to store large amount of data, use Spark on YARN for simple data processing, can also can try some machine learning frameworks such as TensorFlow or XGBoost on the hadoop-based big data platform for machine learning or deep learning.
Another important change is the roles in machine learning. As the increasing dataset and more and more complex problem, one person can’t do all of the work, we need data scientists work together with software engineers. Data scientist usually explore data, find the best machine learning pipeline. After that, software engineer will deploy the model and make prediction based on new input. The input data could be batch data or streaming data.
This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles.
We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service.
The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on.
Let’s find out how big data infrastructure could help machine learning step by step.
Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system.
After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload.
And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced.
After that, we random split the dataset for training and test under the help of Spark.
Once we get training data, we can start feature engineering.
Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning.
In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers.
DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
Model training is the most important step of the whole pipeline.
Deep learning is becoming more and more powerful, but it can’t solve all of humanity‘s problems. In natural language processing, computer vision, speech or video recognition areas, deep learning may behavior better than traditional model. But for problems like recommendation or CTR estimation, very scalable linear models still play a major roles. And some graphic related model like topic model or PageRank, we still need graph calculation engine.
Further more, hybrid model is becoming more and more useful. For example, Facebook presents a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classier, illustrated in the figure. Their experience tells us, this hybrid structure significantly increases the prediction accuracy.
Google also developed a hybrid model - wide and deep learning model, which jointly train a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer.
From the above cases, we can learn that a machine learning platform should support traditional machine learning model and deep learning model, both of them are very useful.
Deploy the model distributed for parallel model serving on batch mode or streaming mode.
Evaluate the model offline or online by different metrics.
Predicting Ad CTR is a massive-scale machine learning problem that is central to the multi-billion dollar online advertising industry.
A typical CTR prediction problem shares some similarities with many other industry machine learning problems, which makes it very representative.
Usually there are billions of Ad impressions daily. Each impression has an unique id. We need join impressions with click stream every x minutes as the dataset for machine learning.
Each model has advantage and disadvantage:
Non-linear models, on the other hand, are able to utilize different feature combinations and thus could potentially improve estimation performance, but can’t scale to a large number of parameter.
Deep neural networks (DNNs) are able to extract the hidden structures and intrinsic patterns at different levels of abstractions from training data. But training deep neural networks on a large input feature space requires tuning a huge number of parameters, which is computationally expensive. And the input raw features are high dimensional and sparse binary features converted from the raw categorical features, which makes it hard to train traditional DNNs in large scale.