TensorFlow™ is a popular open source software library for machine intelligence. While TF gives people abilities to describe the latest algorithm for machine learning and deep learning, it is also very important to make TF can be best fitted into the Hadoop ecosystem. In this session, we will talk about how Hadoop ecosystem components boosts TF and other machine learning technologies, including:
1) Using Hadoop YARN to manage large scale TF services running on a GPU-equipped cluster, and share the same cluster with other tenants and applications.
2) Using Spark/Hive for large scale data preprocessing.
3) Using Zeppelin as an interactive interface to orchestrate and visualize the learning workflow.
At last, we will use a classic machine learning challenge - online ads Click Through Rate (CTR) prediction as an example to show how TF works with YARN, Spark and Zeppelin to train a better model in an efficient way.
Data is flooding into every business. In many applications, more training data and bigger models means better result. We use Hadoop to store large amount of data, use Spark on YARN for simple data processing, can also can try some machine learning frameworks such as TensorFlow or XGBoost on the hadoop-based big data platform for machine learning or deep learning.
Another important change is the roles in machine learning. As the increasing dataset and more and more complex problem, one person can’t do all of the work, we need data scientists work together with software engineers. Data scientist usually explore data, find the best machine learning pipeline. After that, software engineer will deploy the model and make prediction based on new input. The input data could be batch data or streaming data.
This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles.
We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service.
The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
Just like the workflow shows, only a tiny fraction of the code is actually devoted to model learning. The machine learning workflow usually need lots of supports from the big data platform, such as data collection from different data sources, feature extraction, feature transform, and so on.
Let’s find out how big data infrastructure could help machine learning step by step.
Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system.
After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload.
And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced.
After that, we random split the dataset for training and test under the help of Spark.
Once we get training data, we can start feature engineering.
Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning.
In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers.
DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
Model training is the most important step of the whole pipeline.
Deep learning is becoming more and more powerful, but it can’t solve all of humanity‘s problems. In natural language processing, computer vision, speech or video recognition areas, deep learning may behavior better than traditional model. But for problems like recommendation or CTR estimation, very scalable linear models still play a major roles. And some graphic related model like topic model or PageRank, we still need graph calculation engine.
Further more, hybrid model is becoming more and more useful. For example, Facebook presents a hybrid model structure: the concatenation of boosted decision trees and of a probabilistic sparse linear classier, illustrated in the figure. Their experience tells us, this hybrid structure significantly increases the prediction accuracy.
Google also developed a hybrid model - wide and deep learning model, which jointly train a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer.
From the above cases, we can learn that a machine learning platform should support traditional machine learning model and deep learning model, both of them are very useful.
Deploy the model distributed for parallel model serving on batch mode or streaming mode.
Evaluate the model offline or online by different metrics.
Predicting Ad CTR is a massive-scale machine learning problem that is central to the multi-billion dollar online advertising industry.
A typical CTR prediction problem shares some similarities with many other industry machine learning problems, which makes it very representative.
Usually there are billions of Ad impressions daily. Each impression has an unique id. We need join impressions with click stream every x minutes as the dataset for machine learning.
Each model has advantage and disadvantage:
Non-linear models, on the other hand, are able to utilize different feature combinations and thus could potentially improve estimation performance, but can’t scale to a large number of parameter.
Deep neural networks (DNNs) are able to extract the hidden structures and intrinsic patterns at different levels of abstractions from training data. But training deep neural networks on a large input feature space requires tuning a huge number of parameters, which is computationally expensive. And the input raw features are high dimensional and sparse binary features converted from the raw categorical features, which makes it hard to train traditional DNNs in large scale.
The features for CTR prediction are drawn from a variety of sources, including the query, the text of the ad creative, and various ad or user related metadata. Then feed the data into the complex pipeline and push the model to online service.
Data scientist A click the export button and save the notebook to a file system or cloud storage, then data scientist B can load this notebook from another web browser. Data scientist B can easily re-run the notebook and help to tune parameters.