This is a typical machine learning, which involves three steps: feature engineering, model training and online service. Not surprisingly, the most important thing is to have the right features: those capturing historical information dominate other types of features. Once we have the right features and the right model, other factors play small roles.
We first get feature representation from raw data, and then feed these features into machine learning model, and then evaluate the model and choose the best one to push into online service.
The machine learning workflow is complicated, usually involves several steps under the help of several infrastructure components.
Machine learning workflow starts with loading data from different data sources, like HDFS, AWS S3 or database system.
After that, we usually join data from different source to generate a wide table. Apache Hive or Apache Spark is the most appropriate tools to handle this workload.
And then, data scientists starts data exploration via Zeppelin. The most common issue is unbalanced label for the dataset, for example, the number of positive label is far more than the negative label. To get more accurate model, we need to subsample data from the group which has more instances to make it balanced.
After that, we random split the dataset for training and test under the help of Spark.
Once we get training data, we can start feature engineering.
Feature engineering technology has made great progress over the past decade, from hand-designed features to automating feature discovery by deep learning.
In many cases, hand-designed features can leverage the understanding of the domain knowledge which will lead to optimal results, Spark MLlib provides lots of feature transform/selection operators to make it simple and easily. But it will involves heavy physical work and need hire experienced engineers.
DNNs has been successful applied in computer vision, speech recognition and natural language processing during recent years. More and more scientists and engineers applied deep neural network in computer vision, speech recognition and natural language and it has achieved good results. DNN can learn features automatically via embedding, the most famous embedding trick is word2vec which can produce a vector space, with each unique word in the corpus being assigned a corresponding vector in the space.
Model training is the most important step of the whole pipeline.
Deploy the model distributed for parallel model serving on batch mode or streaming mode.
Evaluate the model offline or online by different metrics.